{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "source": [
    "选择性能估计的方法通常取决于建模的数据。例如，当我们可以假设观察数据之间存在独立性和相同分布（i.i.d.）时，**交叉验证**通常是最合适的方法[1]。\n",
    "\n",
    "在一般的机器学习建模过程中，我们通常采用如图（a）一样的K折交叉验证方法来评估模型性能或者调整模型的超参数。然而，时间序列数据存在时序自相关性，所以测试数据是不能出现在训练数据之前的，否则将会造成数据泄露！\n",
    "\n",
    "因此，我们在HyperTS中内置了三种交叉验证策略来应对时间序列预测任务的模型评估。具体如下：\n",
    "\n",
    "- 1. 如图(b)所示 *preq-bls*: 数据被分割成n个块。在初始迭代，只有前两个块被使用，第一个块作为训练集，第二个块作为验证集。在接下来的迭代中，第二个块合并进第一个块一起作为训练集，第三个块作为验证集。这个过程一直将所有的块用完为止。\n",
    "\n",
    "- 2. 如图(c)所示 *preq-slid-bls*: 无需在每次迭代后合并块（增长窗口），以滑动窗口的方式忘记较旧的块。当过去的数据被弃用时，通常会采用这种想法，这在非静态环境中比较常见。\n",
    "\n",
    "- 3. 如图(d)所示 *preq-bls-gap*: 在训练集块和验证集块之间引入了间隙块。这个想法背后的基本原理是增加训练集和测试集之间的独立性。\n",
    "\n",
    "对于时间序列分类和回归任务，因为是它们是在样本层次来分割训练集和验证集的，故不涉及数据泄露。我们采用了如图(a)所示的交叉验证方式。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "![cv](../images/cv.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "参考文献\n",
    "\n",
    "[1] Cerqueira V, Torgo L, Mozetič I. Evaluating time series forecasting models: An empirical study on performance estimation methods[J]. Machine Learning, 2020, 109(11): 1997-2028."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**使用示例:**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1. 准备数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hyperts.datasets import load_network_traffic\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = load_network_traffic(univariate=True)\n",
    "train_data, test_data = train_test_split(df, test_size=168, shuffle=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. 创建实验并运行它"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hyperts import make_experiment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "设置参数 ``cv=True``, ``num_folds`` 控制折数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "experiment = make_experiment(train_data=train_data.copy(),\n",
    "                             task='forecast',\n",
    "                             mode='dl',\n",
    "                             timestamp='TimeStamp',\n",
    "                             covariates=['HourSin', 'WeekCos', 'CBWD'],\n",
    "                             forecast_train_data_periods=24*12,\n",
    "                             max_trials=5,\n",
    "                             cv=True,\n",
    "                             num_folds=3)\n",
    "model = experiment.run()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<bound method Pipeline.get_params of Pipeline(steps=[('data_preprocessing',\n",
       "                 TSFDataPreprocessStep(covariate_cleaner=CovariateTransformer(covariables=['HourSin',\n",
       "                                                                                           'WeekCos',\n",
       "                                                                                           'CBWD'],\n",
       "                                                                              data_cleaner_args={'correct_object_dtype': False,\n",
       "                                                                                                 'int_convert_to': 'str'}),\n",
       "                                       covariate_cleaner__covariables=['HourSin',\n",
       "                                                                       'WeekCos',\n",
       "                                                                       'CBWD'],\n",
       "                                       covariate_cleaner__data_cleaner_args={'correct_object_dtype': False,\n",
       "                                                                             'int_convert_to': 'str'},\n",
       "                                       covariate_cols=['HourSin', 'WeekCos',\n",
       "                                                       'CBWD'],\n",
       "                                       cv=True, freq='H',\n",
       "                                       name='data_preprocessing',\n",
       "                                       timestamp_col=['TimeStamp'],\n",
       "                                       train_data_periods=288)),\n",
       "                ('estimator',\n",
       "                 <hyperts.hyper_ts.HyperTSEstimator object at 0x00000287E696A550>)])>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.get_params"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3. 推理与评估"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Metirc</th>\n",
       "      <th>Score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>mae</td>\n",
       "      <td>0.5523</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>mse</td>\n",
       "      <td>0.4975</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>rmse</td>\n",
       "      <td>0.7053</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>mape</td>\n",
       "      <td>0.3192</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>smape</td>\n",
       "      <td>0.2533</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Metirc  Score\n",
       "0    mae 0.5523\n",
       "1    mse 0.4975\n",
       "2   rmse 0.7053\n",
       "3   mape 0.3192\n",
       "4  smape 0.2533"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_test, y_test = model.split_X_y(test_data.copy())\n",
    "forecast = model.predict(X_test)\n",
    "results = model.evaluate(y_true=y_test, y_pred=forecast)\n",
    "results"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}